REad *.doc

Blitz3D Forums/Blitz3D Beginners Area/REad *.doc

Agamer

(Posted 2003) [#1]

Is it possible to read a word document file ino blitz and get the text contained I know how to open and read files but I wonder how to do it from word!

CS_TBL

(Posted 2003) [#2]

hm.. iirc there's no LoadDoc(file$,flags) command yet :)

I guess you have to find out how the word fileformat works and make your own doc-importer.

ford escort

(Posted 2003) [#3]

you can find more info about the .doc fileformat at wotsit's format hope this help you :)

Hansie

(Posted 2003) [#4]

An alternative is to convert the word doc into a plain textfile (.txt), which removes all the header info a word doc contains ...

CS_TBL

(Posted 2003) [#5]

But then again I wonder why Agamer didn't use notepad orso in the first place.. if he really wants those bold/italic/underlined/font/color things, then converting to .txt doesn't help much here..

Perturbatio

(Posted 2003) [#6]

RTF is an alternative (and easier to code since it's just tags).

aCiD2

(Posted 2003) [#7]

rtf is tags? nice :) thats gonna come in handy hehe

Agamer

(Posted 2003) [#8]

yeh I am using it in a program I'm writing at the moment it already ses .txt but it is still nrrd to be able to resad .doc

xlsior

(Posted 2003) [#9]

Part of the problem of .doc is that Microsoft never fully released all specifications of word files -- which is why 3rd party word processors like Open Office, Word Perfect, etc. all have minor problems with certain documents.

It is probably a lot easier to use RTF, which is what Microsoft used with Write/Wordpad, and MS Word 2.0.
This is a much easier markup language, and has much more complete documentation.

Anyway, for pretty much any file format description this is the place to go: http://www.wotsit.org
Hundreds upon hundreds of file format documents can be found there. Great resource.

Agamer

(Posted 2003) [#10]

thanks but I can't find the file format for ms word 200 and above

eBusiness

(Posted 2003) [#11]

Huh, that's a lot of file formats, why didn't anybody tell me about that site before I started cracking various formats? Anyway pretty usefull :)

xlsior

(Posted 2003) [#12]

to my knowledge, Microsoft never released the file format for Word 2000/2003.
They don't want people to be able to open them with different programs, they want those people to buy word as well.

Microsoft simply has too much to lose if other programs like OpenOffice can read/write word documents flawlessly.

OpenOffice is free. Would you buy Microsoft office for $$extortion$$ if you could get a completely free, legal alternative that can do the same thing? No. and microsoft knows that too, hence they simply don't release the specifications for their document formats anymore.

Any info on Word 2000/2003 you'll find has been obtained by people painstakingly tring to reverse engineer the document format... and still not perfect.

bottom line: I don't think that a Word 2000/2003 document viewer in blitz is going to be a realistic expectation... Or in *any* language any time soon, for that matter.

Agamer

(Posted 2004) [#13]

Ohh I don't want to be able to view it or retain the font/bold/italic/size settings all I want do s read the text some one must of used it my program can import from notpwead text but it would be nice to import documnets

Andy_A

(Posted 2004) [#14]

If you just want the text from a Word doc, try this:
(it's easier to show you the code than to type in the explanation)

Graphics 800,600

SetBuffer BackBuffer()

; Open the file to Read 
filein% = ReadFile("C:\My Documents\Blitz test.doc") 
;Just copied and pasted this code into a Word document

; Loop this until we reach the end of file 
While Not Eof(filein)
 
	GetByte% = ReadByte( filein )
	;count the bytes as you read them in
	count = count + 1

	;The Word doc header is 1,536 characters, so just skip past them
	;BTW, the header length is the same for both Word 97 and Word 2000
	If count > 1536 Then

		;Chr$(13) is the next line character, so print the current line, and reset "Word$" to null
		If GetByte = 13 Then
			Text 0,spacing,Word$
			spacing = spacing + 15
			Word$ = ""
		End If

		;If valid ASCII character then continue adding to Word$
		If GetByte > 31 And GetByte < 128 Then
			Word$ = Word$ + Chr$(GetByte)
		End If

	End If

Wend

Flip

WaitKey()

CloseFile filein

End

The straggler characters at the end can be ignored, or I'll leave it to you to figure the rest (cuz I don't know how).

As you can see in the code, the header data is skipped and any valid ASCII character is concatenated to the "Word$" variable. I haven't checked to see if the length of the "footer" data is the same length for every document, but this may be a way to eliminate the stray characters at the end of the text.

Andy

Agamer

(Posted 2004) [#15]

With this I stilll get 3 lines of jumbo afterwards

Andy_A

(Posted 2004) [#16]

Yeah I know, that's what I said in my previous post.

Let me think about it a while. (you can too!)

There's bound to be a way around this.